Skip to content

consensus/grandpa: Fix high number of peer disconnects with invalid justification#9015

Merged
lexnv merged 22 commits intomasterfrom
lexnv/debug-grandpa
Jul 22, 2025
Merged

consensus/grandpa: Fix high number of peer disconnects with invalid justification#9015
lexnv merged 22 commits intomasterfrom
lexnv/debug-grandpa

Conversation

@lexnv
Copy link
Copy Markdown
Contributor

@lexnv lexnv commented Jun 27, 2025

A grandpa race-casse has been identified in the versi-net stack around authority set changes, which leads to the following:

  • T0 / Node A: Completes round (15)
  • T1 / Node A: Applies new authority set change and increments the SetID (from 0 to 1)
  • T2 / Node B: Sends Precommit for round (15) with SetID (0) -- previous set ID
  • T3 / Node B: Applies new authority set change and increments the SetID (1)

In this scenario, Node B is not aware at the moment of sending justifications that the Set ID has changed.
The downstream effect is that Node A will not be able to verify the signature of justifications, since a different SetID is taken into account. This will cascade through the sync engine, where the Node B is wrongfully banned and disconnected.

This PR aims to fix the edge-case by making the grandpa resilient to verifying prior setIDs for signatures.
When the signature of the grandpa justification fails to decode, the prior SetID is also verified. If the prior SetID produces a valid signature, then the outdated justification error is propagated through the code (ie SignatureResult::OutdatedSet).

The sync engine will handle the outdated justifications as invalid, but without banning the peer. This leads to increased stability of the network during authority changes, which caused frequent disconnects to versi-net in the past.

Review Notes

Testing Done

  • Deployed the PR to versi-net with 40 validators
  • Prior we have noticed 10/40 validators disconnecting every 15-20 minutes, leading to instability
  • Over past 24h the issue has been mitigated: https://grafana.teleport.parity.io/goto/FPNWlmsHR?orgId=1
  • Note: bootnodes 0 and 1 are currently running outdated versions that do not incorporate this SetID verification improvement

Closes: #8872
Closes: #1147

lexnv added 2 commits June 27, 2025 11:33
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv self-assigned this Jun 27, 2025
lexnv added 4 commits June 27, 2025 15:08
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@paritytech-review-bot paritytech-review-bot bot requested a review from a team June 27, 2025 15:41
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@paritytech-workflow-stopper
Copy link
Copy Markdown

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/15930559372
Failed job name: build-test-collators

@paritytech-workflow-stopper
Copy link
Copy Markdown

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/15930559371
Failed job name: build-rustdoc

@paritytech-workflow-stopper
Copy link
Copy Markdown

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/15930559372
Failed job name: build-malus

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv changed the title [dnm] consensus/grandpa: Add debug logs for set Ids and round Ids consensus/grandpa: Fix high number of peer disconnects with invalid justification Jun 30, 2025
@lexnv lexnv added the T0-node This PR/Issue is related to the topic “node”. label Jun 30, 2025
@lexnv
Copy link
Copy Markdown
Contributor Author

lexnv commented Jun 30, 2025

/cmd prdoc --audience node_dev --bump patch minor

github-actions bot and others added 5 commits June 30, 2025 09:22
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
lexnv added 2 commits June 30, 2025 10:31
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
lexnv and others added 2 commits July 1, 2025 15:13
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Copy link
Copy Markdown
Contributor

@skunert skunert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

log::debug!(target: log_target, "Bad signature on message from {:?}", id);
// Check if the signature is valid in the previous set.
let prev_set_id = set_id.checked_sub(1).unwrap_or(0);
if prev_set_id == set_id {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not directly check for 0 here, for easier readability? In other cases this should not trigger.

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv enabled auto-merge July 21, 2025 10:08
@lexnv lexnv added this pull request to the merge queue Jul 21, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 21, 2025
@lexnv lexnv added this pull request to the merge queue Jul 22, 2025
Merged via the queue into master with commit 2f8d2a2 Jul 22, 2025
278 of 284 checks passed
@lexnv lexnv deleted the lexnv/debug-grandpa branch July 22, 2025 12:53
@github-project-automation github-project-automation bot moved this to Blocked ⛔️ in Networking Jul 22, 2025
ordian added a commit that referenced this pull request Jul 24, 2025
* master: (67 commits)
  Fix subsume_assets incorrectly merging two AssetsInHolding (#9179)
  Replace `log` with `tracing` on `pallet-bridge-grandpa` (#9294)
  [Staking Async] Saturating accrue era reward points (#9186)
  fix: skip verifying imported blocks (#9280)
  Ci-unified update (with solc and resolc) (#9289)
  Dedup dependencies between dependencies and dev-dependencies (#9233)
  network: Upgrade litep2p to v0.10.0 (#9287)
  consensus/grandpa: Fix high number of peer disconnects with invalid justification (#9015)
  Zombienet CI improvements (#9172)
  Rewrite old disputes test with zombienet-sdk (#9257)
  [revive] eth-decimals (#9101)
  Allow setting idle connection timeout value in custom node implementations (#9251)
  gossip-support: make low connectivity message an error (#9264)
  Rewrite validator disabling test with zombienet-sdk (#9128)
  Fix CandidateDescriptor debug logs (#9255)
  babe: keep stateless verification in `Verifier`, move everything else to the import queue (#9147)
  Allow locking to bump consumer without limits (#9176)
  feat(cumulus): Adds support for additional relay state keys in parachain validation data inherent (#9262)
  zombienet, make logs for para works (#9230)
  Remove `subwasmlib` (#9252)
  ...
EgorPopelyaev added a commit to EgorPopelyaev/polkadot-sdk that referenced this pull request Jul 25, 2025
* Don't use labels for branch names creation in the backport bot (paritytech#9243)

* Remove unused deps (paritytech#9235)

# Description

Remove unused deps using `cargo udeps`

Part of: paritytech#6906

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Branislav Kontur <bkontur@gmail.com>

* Fixed genesis config presets for bridge tests (paritytech#9185)

Closes: paritytech#9116

---------

Co-authored-by: Branislav Kontur <bkontur@gmail.com>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Karol Kokoszka <karol@parity.io>

* Remove `subwasmlib` (paritytech#9252)

This removes `subwasmlib` and replaces it with some custom code to fetch
the metadata. Main point of this change is the removal of some external
dependency.

Closes: paritytech#9203

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* zombienet, make logs for para works (paritytech#9230)

Fix for correctly display the logs (urls) for paras.

* feat(cumulus): Adds support for additional relay state keys in parachain validation data inherent (paritytech#9262)

Adds the possibility for parachain clients to collect additional relay
state keys into the validation data inherent.

With this change, other consensus engines can collect additional relay
keys into the parachain inherent data:
```rs
let paras_inherent_data = ParachainInherentDataProvider::create_at(
  relay_parent,
  relay_client,
  validation_data,
  para_id,
  vec![
     relay_well_known_keys::EPOCH_INDEX.to_vec() // <----- Example
  ],
)
.await;
```

* Allow locking to bump consumer without limits (paritytech#9176)

Locking is a system-level operation, and can only increment the consumer
limit at most once. Therefore, it should use
`inc_consumer_without_limits`. This behavior is optional, and is only
used in the call path of `LockableCurrency`. Reserves, Holds and Freezes
(and other operations like transfer etc.) have the ability to return
`DispatchResult` and don't need this bypass. This is demonstrated in the
unit tests added.

Beyond this, this PR: 

* uses the correct way to get the account data in tests
* adds an `Unexpected` event instead of a silent `debug_assert!`. 
* Adds `try_state` checks for correctness of `account.frozen` invariant.

---------

Co-authored-by: Ankan <10196091+Ank4n@users.noreply.github.com>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* babe: keep stateless verification in `Verifier`, move everything else to the import queue (paritytech#9147)

We agreed to split paritytech#8446
into two PRs: one for BABE (this one) and one for AURA. This is the
easier one.

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Fix CandidateDescriptor debug logs (paritytech#9255)

Regardless of the descriptor version, the CandidateDescriptor was logged
as a CandidateDescriptorV2 instance.

To address this issue we now derive RuntimeDebug only when std is not
enabled so we can have that empty implementation that does not bloat the
runtime WASM. When std is enabled we implement core::fmt::Debug by hand
and print the
structure differently depending on the CandidateDescriptor version.

Fixes: paritytech#8457

---------

Signed-off-by: Alexandru Cihodaru <alexandru.cihodaru@parity.io>
Co-authored-by: Bastian Köcher <git@kchr.de>

* Rewrite validator disabling test with zombienet-sdk (paritytech#9128)

Fixes paritytech#9085

---------

Signed-off-by: Alexandru Cihodaru <alexandru.cihodaru@parity.io>

* gossip-support: make low connectivity message an error (paritytech#9264)

All is not well when a validator is not properly connected, e.g: of
things that might happen:
- Finality might be slightly delay because validator will be no-show
because they can't retrieve PoVs to validate approval work:
paritytech#8915.
- When they author blocks they won't back things because gossiping of
backing statements happen using the grid topology:, e.g blocks authored
by validators with a low number of peers:

https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc-polkadot.helixstreet.io#/explorer/query/26931262

https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc-polkadot.helixstreet.io#/explorer/query/26931260

https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot.api.onfinality.io%2Fpublic-ws#/explorer/query/26931334

https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot-public-rpc.blockops.network%2Fws#/explorer/query/26931314

https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot-public-rpc.blockops.network%2Fws#/explorer/query/26931292

https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fpolkadot-public-rpc.blockops.network%2Fws#/explorer/query/26931447


The problem is seen in `polkadot_parachain_peer_count` metrics, but it
seems people are not monitoring that well enough, so let's make it more
visible nodes with low connectivity are not working in good conditions.

I also reduced the threshold to 85%, so that we don't trigger this error
to eagerly.

---------

Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Co-authored-by: Bastian Köcher <git@kchr.de>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Allow setting idle connection timeout value in custom node implementations (paritytech#9251)

Allow setting idle connection timeout value. This can be helpful in
custom networks to allow maintaining long-lived connections.

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* [revive] eth-decimals (paritytech#9101)

On Ethereum, 1 ETH is represented as 10^18 wei (wei being the smallest
unit).
On Polkadot 1 DOT is defined as 1010 plancks. It means that any value
smaller than 10^8 wei can not be expressed with the native balance. Any
contract that attempts to use such a value currently reverts with a
DecimalPrecisionLoss error.

In theory, RPC can define a decimal representation different from
Ethereum mainnet (10^18). In practice tools (frontend libraries,
wallets, and compilers) ignore it and expect 18 decimals.

The current behaviour breaks eth compatibility and needs to be updated.
See issue paritytech#109 for more details.


Fix  paritytech/contract-issues#109
[weights
compare](https://weights.tasty.limo/compare?unit=weight&ignore_errors=true&threshold=10&method=asymptotic&repo=polkadot-sdk&old=master&new=pg/eth-decimals&path_pattern=substrate/frame/**/src/weights.rs,polkadot/runtime/*/src/weights/**/*.rs,polkadot/bridges/modules/*/src/weights.rs,cumulus/**/weights/*.rs,cumulus/**/weights/xcm/*.rs,cumulus/**/src/weights.rs)

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Alexander Theißen <alex.theissen@me.com>
Co-authored-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>

* Rewrite old disputes test with zombienet-sdk (paritytech#9257)

Fixes: paritytech#9256

---------

Signed-off-by: Alexandru Cihodaru <alexandru.cihodaru@parity.io>

* Zombienet CI improvements (paritytech#9172)

## 🔄 Zombienet CI Refactor: Matrix-Based Workflows

This PR refactors the Zombienet CI workflows to use a **matrix-based
approach**, resulting in:

- ✅ **Easier test maintenance** – easily add or remove tests without
duplicating workflow logic.
- 🩹 **Improved flaky test handling** – flaky tests are excluded by
default but can be explicitly included by pattern.
- 🔍 **Pattern-based test selection** – run only tests matching a name
pattern, ideal for debugging.

---

## 🗂️ Structure Changes

- **Test definitions** are now stored in `.github/zombienet-tests/`.
- Each workflow (`Cumulus`, `Substrate`, `Polkadot`, `Parachain
Template`) has its own YAML file with test configurations.

---

## 🧰 Added Scripts

### `.github/scripts/parse-zombienet-tests.py`
- Parses test definitions and generates a GitHub Actions matrix.
- Filters out flaky tests by default.
- If a `test_pattern` is provided, matching tests are **included even if
flaky**.

### `.github/scripts/dispatch-zombienet-workflow.sh`
- Triggers a Zombienet workflow multiple times, optionally filtered by
test name pattern.
- Stores results in a **CSV file** for analysis.
- Useful for debugging flaky tests or stress-testing specific workflows.
- Intended to be run from the local machine.

---------

Co-authored-by: Javier Viola <363911+pepoviola@users.noreply.github.com>
Co-authored-by: Alexander Samusev <41779041+alvicsam@users.noreply.github.com>
Co-authored-by: Javier Viola <javier@parity.io>

* consensus/grandpa: Fix high number of peer disconnects with invalid justification (paritytech#9015)

A grandpa race-casse has been identified in the versi-net stack around
authority set changes, which leads to the following:

- T0 / Node A: Completes round (15)
- T1 / Node A: Applies new authority set change and increments the SetID
(from 0 to 1)
- T2 / Node B: Sends Precommit for round (15) with SetID (0) -- previous
set ID
- T3 / Node B: Applies new authority set change and increments the SetID
(1)

In this scenario, Node B is not aware at the moment of sending
justifications that the Set ID has changed.
The downstream effect is that Node A will not be able to verify the
signature of justifications, since a different SetID is taken into
account. This will cascade through the sync engine, where the Node B is
wrongfully banned and disconnected.

This PR aims to fix the edge-case by making the grandpa resilient to
verifying prior setIDs for signatures.
When the signature of the grandpa justification fails to decode, the
prior SetID is also verified. If the prior SetID produces a valid
signature, then the outdated justification error is propagated through
the code (ie `SignatureResult::OutdatedSet`).

The sync engine will handle the outdated justifications as invalid, but
without banning the peer. This leads to increased stability of the
network during authority changes, which caused frequent disconnects to
versi-net in the past.

### Review Notes
- Main changes that verify prior SetId on failures are placed in
[check_message_signature_with_buffer](https://github.com/paritytech/polkadot-sdk/pull/9015/files#diff-359d7a46ea285177e5d86979f62f0f04baabf65d595c61bfe44b6fc01af70d89R458-R501)
- Sync engine no longer disconnects outdated justifications in
[process_service_command](https://github.com/paritytech/polkadot-sdk/pull/9015/files#diff-9ab3391aa82ee2b2868ece610100f84502edcf40638dba9ed6953b6e572dfba5R678-R703)

### Testing Done
- Deployed the PR to versi-net with 40 validators
- Prior we have noticed 10/40 validators disconnecting every 15-20
minutes, leading to instability
- Over past 24h the issue has been mitigated:
https://grafana.teleport.parity.io/goto/FPNWlmsHR?orgId=1
- Note: bootnodes 0 and 1 are currently running outdated versions that
do not incorporate this SetID verification improvement

Closes: paritytech#8872
Closes: paritytech#1147

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>

* network: Upgrade litep2p to v0.10.0 (paritytech#9287)

## litep2p v0.10.0

This release adds the ability to use system DNS resolver and change
Kademlia DNS memory store capacity. It also fixes the Bitswap protocol
implementation and correctly handles the dropped notification substreams
by unregistering them from the protocol list.

### Added

- kad: Expose memory store configuration
([paritytech#407](paritytech/litep2p#407))
- transport: Allow changing DNS resolver config
([paritytech#384](paritytech/litep2p#384))

### Fixed

- notification: Unregister dropped protocols
([paritytech#391](paritytech/litep2p#391))
- bitswap: Fix protocol implementation
([paritytech#402](paritytech/litep2p#402))
- transport-manager: stricter supported multiaddress check
([paritytech#403](paritytech/litep2p#403))

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Dedup dependencies between dependencies and dev-dependencies (paritytech#9233)

# Description

Deduplicate some dependencies between `dependencies` and
`dev-dependencies` sections

---------

Co-authored-by: Bastian Köcher <git@kchr.de>

* Ci-unified update (with solc and resolc) (paritytech#9289)

add `solc` and `resolc` binaries to image

```
$ solc --version
solc, the solidity compiler commandline interface
Version: 0.8.30+commit.73712a01.Linux.g++
$ resolc --version
Solidity frontend for the revive compiler version 0.3.0+commit.ed60869.llvm-18.1.8
```

You can update or install specific version with `/builds/download-bin.sh
<solc | resolc> [version | latest]`
e.g.
```
/builds/download-bin.sh solc v0.8.30
```

* fix: skip verifying imported blocks (paritytech#9280)

Closes paritytech#9277. Still WIP
testing

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* [Staking Async] Saturating accrue era reward points (paritytech#9186)

Replaces regular addition with saturating addition when accumulating era
reward points in `pallet-staking-async` to prevent potential overflow.

---------

Co-authored-by: Bastian Köcher <git@kchr.de>

* Replace `log` with `tracing` on `pallet-bridge-grandpa` (paritytech#9294)

This PR replaces `log` with `tracing` instrumentation on
`pallet-bridge-grandpa` by providing structured logging.

Partially addresses paritytech#9211

* Fix subsume_assets incorrectly merging two AssetsInHolding (paritytech#9179)

`subsume_assets` fails to correctly subsume two instances of
`AssetsInHolding` under certain conditions which can result in loss of
funds (as assets are overriden rather than summed together)

Eg. consider following test:
```
	#[test]
	fn subsume_assets_different_length_holdings() {
		let mut t1 = AssetsInHolding::new();
		t1.subsume(CFP(400));

		let mut t2 = AssetsInHolding::new();
		t2.subsume(CF(100));
		t2.subsume(CFP(100));

		t1.subsume_assets(t2);
```

current result (without this PR change):
```
		let mut iter = t1.into_assets_iter();
		assert_eq!(Some(CF(100)), iter.next());
		assert_eq!(Some(CFP(100)), iter.next());
```

expected result:
```
		let mut iter = t1.into_assets_iter();
		assert_eq!(Some(CF(100)), iter.next());
		assert_eq!(Some(CFP(500)), iter.next());
```

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Branislav Kontur <bkontur@gmail.com>

* yap-runtime: fixes for `GetParachainInfo` (paritytech#9312)

This fixes the YAP parachain runtimes in case you encounter a panic in
the collator similar to
paritytech/zombienet#2050:
```
Failed to retrieve the parachain id
```
(which we do have zombienet-sdk tests for
[here](https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/transaction-pool/tests/zombienet/yap_test.rs))

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* RecentDisputes/ActiveDisputes use BTreeMap instead of Vec (paritytech#9309)

Fixes paritytech#782

---------

Signed-off-by: Alexandru Cihodaru <alexandru.cihodaru@parity.io>

* network/litep2p: Switch to system DNS resolver (paritytech#9321)

Switch to system DNS resolver instead of 8.8.8.8 that litep2p uses by
default. This enables full administrator control of what upstream DNS
servers to use, including resolution of local names using custom DNS
servers.

Fixes paritytech#9298.

---------

Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* litep2p/discovery: Ensure non-global addresses are not reported as external (paritytech#9281)

This PR ensures that external addresses discovered by the identify
protocol are not propagated to the litep2p backend if they are not
global. This leads to a healthier DHT over time, since nodes will not
advertise loopback / non-global addresses.

We have seen various cases were loopback addresses were reported as
external:

```
2025-07-16 16:18:39.765 TRACE tokio-runtime-worker sub-libp2p::discovery: verify new external address: /ip4/127.0.0.1/tcp/30310/p2p/12D3KooWNw19ScMjzNGLnYYLQxWcM9EK9VYPbCq241araUGgbdLM    

2025-07-16 16:18:39.765  INFO tokio-runtime-worker sub-libp2p: 🔍 Discovered new external address for our node: /ip4/127.0.0.1/tcp/30310/p2p/12D3KooWNw19ScMjzNGLnYYLQxWcM9EK9VYPbCq241araUGgbdLM
```

This PR takes into account the network config for
`allow_non_global_addresses`.

Closes: paritytech#9261

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* [Backport] Regular version bumps and prdoc reordering from the stable2506 release branch back to master (paritytech#9320)

This PR backports:
- NODE_VERSION bumps
- spec_version bumps
- prdoc reordering
from the release branch back to master

---------

Co-authored-by: ParityReleases <release-team@parity.io>

* add node version to the announcement message

* test in the internal room

---------

Signed-off-by: Alexandru Cihodaru <alexandru.cihodaru@parity.io>
Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Diego <diego2737@gmail.com>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Branislav Kontur <bkontur@gmail.com>
Co-authored-by: Anthony Kveder <32168055+antkve@users.noreply.github.com>
Co-authored-by: Karol Kokoszka <karol@parity.io>
Co-authored-by: Bastian Köcher <git@kchr.de>
Co-authored-by: Javier Viola <363911+pepoviola@users.noreply.github.com>
Co-authored-by: Rodrigo Quelhas <22591718+RomarQ@users.noreply.github.com>
Co-authored-by: Kian Paimani <5588131+kianenigma@users.noreply.github.com>
Co-authored-by: Ankan <10196091+Ank4n@users.noreply.github.com>
Co-authored-by: sistemd <enntheprogrammer@gmail.com>
Co-authored-by: Alexandru Cihodaru <40807189+AlexandruCihodaru@users.noreply.github.com>
Co-authored-by: Alexandru Gheorghe <49718502+alexggh@users.noreply.github.com>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Co-authored-by: PG Herveou <pgherveou@gmail.com>
Co-authored-by: Alexander Theißen <alex.theissen@me.com>
Co-authored-by: Oliver Tale-Yazdi <oliver.tale-yazdi@parity.io>
Co-authored-by: Lukasz Rubaszewski <117115317+lrubasze@users.noreply.github.com>
Co-authored-by: Alexander Samusev <41779041+alvicsam@users.noreply.github.com>
Co-authored-by: Javier Viola <javier@parity.io>
Co-authored-by: Alexandru Vasile <60601340+lexnv@users.noreply.github.com>
Co-authored-by: Evgeny Snitko <evgeny@parity.io>
Co-authored-by: Raymond Cheung <178801527+raymondkfcheung@users.noreply.github.com>
Co-authored-by: ordian <4211399+ordian@users.noreply.github.com>
Co-authored-by: ParityReleases <release-team@parity.io>
tmpolaczyk pushed a commit to moondance-labs/polkadot-sdk that referenced this pull request Oct 14, 2025
…ustification (paritytech#9015)

A grandpa race-casse has been identified in the versi-net stack around
authority set changes, which leads to the following:

- T0 / Node A: Completes round (15)
- T1 / Node A: Applies new authority set change and increments the SetID
(from 0 to 1)
- T2 / Node B: Sends Precommit for round (15) with SetID (0) -- previous
set ID
- T3 / Node B: Applies new authority set change and increments the SetID
(1)

In this scenario, Node B is not aware at the moment of sending
justifications that the Set ID has changed.
The downstream effect is that Node A will not be able to verify the
signature of justifications, since a different SetID is taken into
account. This will cascade through the sync engine, where the Node B is
wrongfully banned and disconnected.

This PR aims to fix the edge-case by making the grandpa resilient to
verifying prior setIDs for signatures.
When the signature of the grandpa justification fails to decode, the
prior SetID is also verified. If the prior SetID produces a valid
signature, then the outdated justification error is propagated through
the code (ie `SignatureResult::OutdatedSet`).

The sync engine will handle the outdated justifications as invalid, but
without banning the peer. This leads to increased stability of the
network during authority changes, which caused frequent disconnects to
versi-net in the past.

### Review Notes
- Main changes that verify prior SetId on failures are placed in
[check_message_signature_with_buffer](https://github.com/paritytech/polkadot-sdk/pull/9015/files#diff-359d7a46ea285177e5d86979f62f0f04baabf65d595c61bfe44b6fc01af70d89R458-R501)
- Sync engine no longer disconnects outdated justifications in
[process_service_command](https://github.com/paritytech/polkadot-sdk/pull/9015/files#diff-9ab3391aa82ee2b2868ece610100f84502edcf40638dba9ed6953b6e572dfba5R678-R703)

### Testing Done
- Deployed the PR to versi-net with 40 validators
- Prior we have noticed 10/40 validators disconnecting every 15-20
minutes, leading to instability
- Over past 24h the issue has been mitigated:
https://grafana.teleport.parity.io/goto/FPNWlmsHR?orgId=1
- Note: bootnodes 0 and 1 are currently running outdated versions that
do not incorporate this SetID verification improvement

Closes: paritytech#8872
Closes: paritytech#1147

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
alvicsam pushed a commit that referenced this pull request Oct 17, 2025
…ustification (#9015)

A grandpa race-casse has been identified in the versi-net stack around
authority set changes, which leads to the following:

- T0 / Node A: Completes round (15)
- T1 / Node A: Applies new authority set change and increments the SetID
(from 0 to 1)
- T2 / Node B: Sends Precommit for round (15) with SetID (0) -- previous
set ID
- T3 / Node B: Applies new authority set change and increments the SetID
(1)

In this scenario, Node B is not aware at the moment of sending
justifications that the Set ID has changed.
The downstream effect is that Node A will not be able to verify the
signature of justifications, since a different SetID is taken into
account. This will cascade through the sync engine, where the Node B is
wrongfully banned and disconnected.

This PR aims to fix the edge-case by making the grandpa resilient to
verifying prior setIDs for signatures.
When the signature of the grandpa justification fails to decode, the
prior SetID is also verified. If the prior SetID produces a valid
signature, then the outdated justification error is propagated through
the code (ie `SignatureResult::OutdatedSet`).

The sync engine will handle the outdated justifications as invalid, but
without banning the peer. This leads to increased stability of the
network during authority changes, which caused frequent disconnects to
versi-net in the past.

### Review Notes
- Main changes that verify prior SetId on failures are placed in
[check_message_signature_with_buffer](https://github.com/paritytech/polkadot-sdk/pull/9015/files#diff-359d7a46ea285177e5d86979f62f0f04baabf65d595c61bfe44b6fc01af70d89R458-R501)
- Sync engine no longer disconnects outdated justifications in
[process_service_command](https://github.com/paritytech/polkadot-sdk/pull/9015/files#diff-9ab3391aa82ee2b2868ece610100f84502edcf40638dba9ed6953b6e572dfba5R678-R703)

### Testing Done
- Deployed the PR to versi-net with 40 validators
- Prior we have noticed 10/40 validators disconnecting every 15-20
minutes, leading to instability
- Over past 24h the issue has been mitigated:
https://grafana.teleport.parity.io/goto/FPNWlmsHR?orgId=1
- Note: bootnodes 0 and 1 are currently running outdated versions that
do not incorporate this SetID verification improvement

Closes: #8872
Closes: #1147

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
khssnv added a commit to QuantumFusion-network/qf-solochain that referenced this pull request Dec 9, 2025
tmpolaczyk pushed a commit to moondance-labs/polkadot-sdk that referenced this pull request Jan 15, 2026
…ustification (paritytech#9015)

A grandpa race-casse has been identified in the versi-net stack around
authority set changes, which leads to the following:

- T0 / Node A: Completes round (15)
- T1 / Node A: Applies new authority set change and increments the SetID
(from 0 to 1)
- T2 / Node B: Sends Precommit for round (15) with SetID (0) -- previous
set ID
- T3 / Node B: Applies new authority set change and increments the SetID
(1)

In this scenario, Node B is not aware at the moment of sending
justifications that the Set ID has changed.
The downstream effect is that Node A will not be able to verify the
signature of justifications, since a different SetID is taken into
account. This will cascade through the sync engine, where the Node B is
wrongfully banned and disconnected.

This PR aims to fix the edge-case by making the grandpa resilient to
verifying prior setIDs for signatures.
When the signature of the grandpa justification fails to decode, the
prior SetID is also verified. If the prior SetID produces a valid
signature, then the outdated justification error is propagated through
the code (ie `SignatureResult::OutdatedSet`).

The sync engine will handle the outdated justifications as invalid, but
without banning the peer. This leads to increased stability of the
network during authority changes, which caused frequent disconnects to
versi-net in the past.

### Review Notes
- Main changes that verify prior SetId on failures are placed in
[check_message_signature_with_buffer](https://github.com/paritytech/polkadot-sdk/pull/9015/files#diff-359d7a46ea285177e5d86979f62f0f04baabf65d595c61bfe44b6fc01af70d89R458-R501)
- Sync engine no longer disconnects outdated justifications in
[process_service_command](https://github.com/paritytech/polkadot-sdk/pull/9015/files#diff-9ab3391aa82ee2b2868ece610100f84502edcf40638dba9ed6953b6e572dfba5R678-R703)

### Testing Done
- Deployed the PR to versi-net with 40 validators
- Prior we have noticed 10/40 validators disconnecting every 15-20
minutes, leading to instability
- Over past 24h the issue has been mitigated:
https://grafana.teleport.parity.io/goto/FPNWlmsHR?orgId=1
- Note: bootnodes 0 and 1 are currently running outdated versions that
do not incorporate this SetID verification improvement

Closes: paritytech#8872
Closes: paritytech#1147

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T0-node This PR/Issue is related to the topic “node”.

Projects

Status: Blocked ⛔️

Development

Successfully merging this pull request may close these issues.

consensus/grandpa: High number of disconnects with invalid justification sync: Invalid justification provided

6 participants